For my R final project, I look at some data from this year’s NBA season.
library(plotly)
library(readxl)
library(d3heatmap)
library(viridis)
library(tidyverse)
library(ggfortify)
NBA_Player_Stats_master <- read_excel("C:/Users/mog2/Downloads/2019-2020 NBA Player Stats NBAstuffer.xlsx", skip = 1)
head(NBA_Player_Stats_master)
## # A tibble: 6 x 29
## RANK `FULL NAME` TEAM POS AGE GP MPG `MIN%Minutes Pe~
## <lgl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 NA Steven Ada~ Okc C 26.4 20 27.1 56.4
## 2 NA Bam Adebayo Mia C-F 22.4 24 33.5 69.8
## 3 NA LaMarcus A~ San F-C 34.4 21 32.9 68.6
## 4 NA Nickeil Al~ Nor G 21.3 20 14 29.2
## 5 NA Grayson Al~ Mem G 24.2 13 18.9 39.5
## 6 NA Jarrett Al~ Bro C 21.6 23 26.5 55.2
## # ... with 21 more variables: `USG%Usage RateUsage rate, a.k.a., usage
## # percentage is an estimate of the percentage of team plays used by a
## # player while he was on the floor` <dbl>, `TO%Turnover RateA metric
## # that estimates the number of turnovers a player commits per 100
## # possessions` <dbl>, FTA <dbl>, `FT%` <dbl>, `2PA` <dbl>, `2P%` <dbl>,
## # `3PA` <dbl>, `3P%` <dbl>, `eFG%Effective Shooting PercentageWith eFG%,
## # three-point shots made are worth 50% more than two-point shots made.
## # eFG% Formula=(FGM+ (0.5 x 3PM))/FGA` <dbl>, `TS%True Shooting
## # PercentageTrue shooting percentage is a measure of shooting efficiency
## # that takes into account field goals, 3-point field goals, and free
## # throws.` <dbl>, `PPGPointsPoints per game.` <dbl>,
## # `RPGReboundsRebounds per game.` <dbl>, `TRB%Total Rebound
## # PercentageTotal rebound percentage is estimated percentage of
## # available rebounds grabbed by the player while the player is on the
## # court.` <dbl>, `APGAssistsAssists per game.` <dbl>, `AST%Assist
## # PercentageAssist percentage is an estimated percentage of teammate
## # field goals a player assisted while the player is on the court` <dbl>,
## # `SPGStealsSteals per game.` <dbl>, `BPGBlocksBlocks per game.` <dbl>,
## # `TOPGTurnoversTurnovers per game.` <dbl>, `VIVersatility
## # IndexVersatility index is a metric that measures a player’s ability to
## # produce in points, assists, and rebounds. The average player will
## # score around a five on the index, while top players score above
## # 10` <dbl>, `ORTGOffensive RatingIndividual offensive rating is the
## # number of points produced by a player per 100 total individual
## # possessions.` <dbl>, `DRTGDefensive RatingIndividual defensive rating
## # estimates how many points the player allowed per 100 possessions he
## # individually faced while staying on the court.` <dbl>
To tidy the data, I select the columns of information that I want and also change the column names.
NBA_Player_Stats <- NBA_Player_Stats_master %>%
select(c(2:7,11:16,19,20,22,24:26))
colnames(NBA_Player_Stats)[13:18] <- c("PPG","RPG","APG","SPG","BPG","TOPG")
Currently, the table is arranged with the players and their respective teams, so I wanted to look at the range of points scored by players on each team. To do this, I used ggplot with boxplots and faceted the data by teams.
p <- ggplot(NBA_Player_Stats, aes(x = TEAM, y = `PPG`, fill = TEAM)) + geom_boxplot() + scale_fill_viridis(discrete = TRUE, alpha=0.6) +
geom_jitter(color="black", size=0.4, alpha=0.9)
ggplotly(p)
I grouped the data by TEAM and added an additional column labeled as total_ppg. I then imported another data table and joined them by team name to get win percentage.
Winning <- NBA_Player_Stats
Winning <- Winning %>%
group_by(TEAM) %>%
mutate(total_ppg = sum(PPG))
Standings <- read_excel("C:/Users/mog2/Downloads/2019-2020 NBA Standings.xlsx")
Winning <- left_join(Winning,Standings, by = 'TEAM')
p5 <- plot_ly(Winning, x = ~total_ppg, y = ~`Win%`, type = 'scatter', mode = 'markers',
marker = list(size = 5, opacity = 0.5),
hoverinfo = 'text',
text = ~paste('Team:', `TEAM`, '<br> Win%:', `Win%`, '<br> Total average PPG:', total_ppg)) %>%
layout(title = 'Total average PPG versus win percentage')
p5
WinningLinear <- lm(total_ppg ~ `Win%`, data = Winning)
broom::tidy(WinningLinear)
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 135. 1.35 99.6 8.02e-315
## 2 `Win%` 2.06 2.51 0.822 4.11e- 1
broom::glance(WinningLinear)
## # A tibble: 1 x 11
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.00146 -0.000699 10.8 0.676 0.411 2 -1765. 3536. 3548.
## # ... with 2 more variables: deviance <dbl>, df.residual <int>
autoplot(WinningLinear)
I wanted to look at some of the stats of the top 50 scoring NBA players and see how their other stats compared.
Top_20 <- NBA_Player_Stats %>%
arrange(desc(PPG))
Top_20 <- Top_20[1:20,]
Top_20_d3heatmap <- Top_20 %>%
select('FULL NAME',13:18)
row.names(Top_20_d3heatmap) <- Top_20_d3heatmap$`FULL NAME`
## Warning: Setting row names on a tibble is deprecated.
Top_20_d3heatmap$`FULL NAME` <- NULL
p2 <- d3heatmap(scale(Top_20_d3heatmap), dendrogram = 'none', color = "Blues")
p2
I was then curious to see if a player’s shooting percentage (2P%, 3P%, FT%) had any correlation with a player’s PPG. First, I filtered the initial dataset to only include players that had shot above the average number of shots. Then I graphed percentage versus PPG. The dot size and color represents the overall number of shots taken.
TwoPA <- NBA_Player_Stats %>%
filter(`2PA` > mean(`2PA`))
ThreePA <- NBA_Player_Stats %>%
filter(`3PA` > mean(`3PA`))
FTA <- NBA_Player_Stats %>%
filter(`FTA` > mean(`FTA`))
p5 <- plot_ly(TwoPA, x = ~`2P%`, y = ~PPG, type = 'scatter', mode = 'markers', color = ~`2PA`, colors = 'Blues',
marker = list(size = ~`2PA`/10, opacity = 0.5),
hoverinfo = 'text',
text = ~paste('Player:', `FULL NAME`, '<br> PPG:', PPG, '<br> 2PA:', `2PA`, '<br> 2P%:', `2P%`)) %>%
layout(title = '2 pointer percentage versus PPG')
p5
p6 <- plot_ly(ThreePA, x = ~`3P%`, y = ~PPG, type = 'scatter', mode = 'markers', color = ~`3PA`, colors = 'Reds',
marker = list(size = ~`3PA`/10, opacity = 0.5),
hoverinfo = 'text',
text = ~paste('Player:', `FULL NAME`, '<br> PPG:', PPG, '<br> 3PA:', `3PA`, '<br> 3P%:', `3P%`)) %>%
layout(title = '3 pointer percentage versus PPG')
p6
p7 <- plot_ly(FTA, x = ~`FT%`, y = ~PPG, type = 'scatter', mode = 'markers', color = ~`FTA`, colors = 'Greens',
marker = list(size = ~`FTA`/10, opacity = 0.5),
hoverinfo = 'text',
text = ~paste('Player:', `FULL NAME`, '<br> PPG:', PPG, '<br> FTA:', `FTA`, '<br> FT%:', `FT%`)) %>%
layout(title = 'Free throw percentage versus PPG')
p7
TwoPALinear <- lm(`2P%` ~ PPG, data = TwoPA)
broom::tidy(TwoPALinear)
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 0.523 0.0148 35.3 2.09e-84
## 2 PPG -0.000497 0.000953 -0.522 6.02e- 1
broom::glance(TwoPALinear)
## # A tibble: 1 x 11
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.00146 -0.00391 0.0779 0.273 0.602 2 214. -422. -412.
## # ... with 2 more variables: deviance <dbl>, df.residual <int>
autoplot(TwoPALinear)
ThreePALinear <- lm(`3P%` ~ `PPG`, data = ThreePA)
broom::tidy(ThreePALinear)
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 0.360 0.00895 40.2 2.46e-94
## 2 PPG 0.0000955 0.000607 0.157 8.75e- 1
broom::glance(ThreePALinear)
## # A tibble: 1 x 11
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.000132 -0.00519 0.0533 0.0247 0.875 2 288. -571. -561.
## # ... with 2 more variables: deviance <dbl>, df.residual <int>
autoplot(ThreePALinear)
FTALinear <- lm(`FT%` ~ `PPG`, data = FTA)
broom::tidy(FTALinear)
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 0.682 0.0216 31.6 3.31e-70
## 2 PPG 0.00604 0.00132 4.58 9.41e- 6
broom::glance(FTALinear)
## # A tibble: 1 x 11
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.117 0.112 0.0988 21.0 9.41e-6 2 144. -283. -274.
## # ... with 2 more variables: deviance <dbl>, df.residual <int>
autoplot(FTALinear)